Search CORE

37 research outputs found

Provenance-based Reconciliation In Conflicting Data

Author: Duong Chi Thang
Publication venue
Publication date: 03/04/2016
Field of study

Data fusion is the process of resolving conflicting data from multiple data sources. As the data sources are inherently heterogenous, there is a need for an expert to resolve the conflicting data. Traditional approach requires the expert to resolve a considerable amount of conflicts in order to acquire a high quality dataset. In this project, we consider how to acquire a high quality dataset while maintaining the expert effort minimal. At first, we achieve this goal by building a model which leverages the provenance of the data in reconciling conflicting data. Secondly, we improve our model by taking the dependency between data sources into account. In the end, we empirically show that our solution can significantly reduce the user effort while it can obtain a high quality dataset in comparison with traditional method

Infoscience - École polytechnique fédérale de Lausanne

Crowdsourcing Literal Review

Author: Duong Chi Thang
Publication venue
Publication date: 25/03/2016
Field of study

Our user feedback framework requires some robust techniques in order to tackle the scalability issue of schema matching network. One approach is employing crowd-sourcing/human computation models. Crowdsourcing is one of cutting-edge research areas which involves human computers to perform pre-defined tasks. In this literal review, we try to explore some certain concepts such as task, work-flow, feedback aggregation, quality control and reward system. We show that there are a lot of aspects which can be integrated into our user feedback framework

Infoscience - École polytechnique fédérale de Lausanne

Fighting Rumours on Social Media

Author: Duong Chi Thang
Publication venue
Publication date: 10/02/2019
Field of study

With the advance of social platforms, people are sharing contents in an unprecedented scale. This makes social platforms an ideal place for spreading rumors. As rumors may have negative impacts on the real world, many rumor detection techniques have been proposed. In this proposal, we summarize several works that focus on two important steps of rumor detection. The first step involves detecting controversial events from the data streams which are candidates for rumors. The aim of the second step is to find out the truth values of these events i.e. whether they are rumors or not. Although some techniques are able to achieve state-of-the-art results, they do not cope well with the streaming nature of social platforms. In addition, they usually leverage only one type of information available on social platforms such as only the posts. To overcome these limitations, we propose two research directions that emphasize on 1) detecting rumors in a progressive manner and 2) combining different types of information for better detection

Infoscience - École polytechnique fédérale de Lausanne

Investigating Graph Embedding Methods for Cross-Platform Binary Code Similarity Detection

Author: Cochard Victor
Duong Chi Thang
Humbert Mathias
Pfammatter Damian
Publication venue
Publication date: 08/06/2022
Field of study

IoT devices are increasingly present, both in the industry and in consumer markets, but their security remains weak, which leads to an unprecedented number of attacks against them. In order to reduce the attack surface, one approach is to analyze the binary code of these devices to early detect whether they contain potential security vulnerabilities. More specifically, knowing some vulnerable function, we can determine whether the firmware of an IoT device contains some security flaw by searching for this function. However, searching for similar vulnerable functions is in general challenging due to the fact that the source code is often not openly available and that it can be compiled for different architectures, using different compilers and compilation settings. In order to handle these varying settings, we can compare the similarity between the graph embeddings derived from the binary functions. In this paper, inspired by the recent advances in deep learning, we propose a new method – GESS (graph embeddings for similarity search) – to derive graph embeddings, and we compare it with various state-of-the-art methods. Our empirical evaluation shows that GESS reaches an AUC of 0.979, thereby outperforming the best known approach. Furthermore, for a fixed low false positive rate, GESS provides a true positive rate (or recall) about 36% higher than the best previous approach. Finally, for a large search space, GESS provides a recall between 50% and 60% higher than the best previous approach

Serveur académique lausannois

A comparison of network embedding approaches

Author: Duong Chi Thang
Nguyen Thanh Tam
Publication venue
Publication date: 31/12/2018
Field of study

Network embedding automatically learns to encode a graph into multi-dimensional vectors. The embedded representation appears to outperform hand-crafted features in many downstream machine learning tasks. There are a plethora of network embedding approaches in the last decade, based on the advances and successes of deep learning. However, there is no absolute winner as the network structure varies from application to application and the notion of connections in a graph has its own semantics in different domains. In this report, we compare different network embedding approaches in real and synthetic datasets, covering different graph structures. Although our prototype currently includes only two network embedding techniques, it can be easily extended due to our systematic evaluation methodology, and available source code

Infoscience - École polytechnique fédérale de Lausanne

Towards Enabling Probabilistic Databases for Participatory Sensing

Author: Aberer Karl
Duong Chi Thang
Nguyen Quoc Viet Hung
Sathe Saket
Publication venue
Publication date: 12/09/2014
Field of study

Participatory sensing has emerged as a new data collection paradigm, in which humans use their own devices (cell phone accelerometers, cameras, etc.) as sensors. This paradigm enables to collect a huge amount of data from the crowd for world-wide applications, without spending cost to buy dedicated sensors. Despite of this benefit, the data collected from human sensors are inherently uncertain due to no quality guarantee from the participants. Moreover, the participatory sensing data are time series that not only exhibit highly irregular dependencies on time, but also vary from sensor to sensor. To overcome these issues, we study in this paper the problem of creating probabilistic data from given (uncertain) time series collected by participatory sensors. We approach the problem in two steps. In the first step, we generate probabilistic times series from raw time series using a dynamical model from the time series literature. In the second step, we combine probabilistic time series from multiple sensors based on the mutual relationship between the reliability of the sensors and the quality of their data. Through extensive experimentation, we demonstrate the efficiency of our approach on both real data and synthetic data

Infoscience - École polytechnique fédérale de Lausanne

Probabilistic Schema Covering

Author: Duong Chi Thang
Nguyen Quoc Viet Hung
Nguyen Thanh Toan
Phan Thanh Cong
Stantic Bela
Publication venue
Publication date: 16/04/2018
Field of study

Schema covering is the process of representing large and complex schemas by easily comprehensible common objects. This task is done by identifying a set of common concepts from a repository called concept repository and generating a cover to describe the schema by the concepts. Traditional schema covering approach has two shortcomings: it does not model the uncertainty in the covering process, and it requires user to state an ambiguity constraint which is hard to define. We remedy this problem by incorporating probabilistic model into schema covering to generate probabilistic schema cover. The integrated probabilities not only enhance the coverage of cover results but also eliminate the need of defining the ambiguity parameter. Both probabilistic schema covering and traditional schema covering run on top of a concept repository. Experiments on real-datasets show the competitive performance of our approach

Infoscience - École polytechnique fédérale de Lausanne